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ABSTRACT 

In this paper three models of parallel speedup are studied. They are fixed-size speedup 
fixed-time speedup and memory-bounded speedup. The latter two consider the relationship 
between speedup and problem scalability. Two sets of speedup formulations are derived 
or these three models. One set considers uneven workload allocation and communication 
overhead, and gives more accurate estimation. Another set considers a simplified case and 
prov, es a clear picture on the impact of the sequential portion of an application on the 
possible performance gam from parallel processing. The simplified fixed-size speedup is Am- 
ahls law The simplified fixed-time speedup is Gustafson’s scaled speedup. The simplified 
memory-bounded speedup contains both Amdahl’s law and Gustafson’s scaled speedup as 
special cases. This study leads to a better understanding of parallel processing 
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1 Introduction 


Although parallel processing has become a common approach for achieving high performance, there 
is no well-established metric to measure the performance gain of parallel processing. The most 
commonly used performance metric for parallel processing is speedup, which gives the performance 
gain of parallel processing versus sequential processing. Traditionally, speedup is defined as the 
ratio of uniprocessor execution time to execution time on a parallel processor. There are different 
ways to define the metric “execution time”. In fixed-size speedup, the amount of work to be executed 
is independent of the number of processors. Based on this model, Ware [17] summarized Amdahl’s 
[1] arguments to define a speedup formula which is known as Amdahl’s law. However, in many 
applications, the amount of work to be performed increases (as the number of processors increases) 
in order to obtain a more accurate or better result. The concept of scaled speedup was proposed by 
Gustafson et al. at Sandia National Laboratory [6], Based on this concept, Gustafson suggested a 
fixed-time speedup [5], which fixes the execution time and is interested in how the problem size can 
be scaled up. In scaled speedup, both sequential and parallel execution times are measured based 
on the same amount of work defined by the scaled problem. 

Both Amdahl’s law and Gustafson’s scaled speedup use a single parameter, the sequential 
portion of a parallel algorithm, to characterize an application. They are simple and give much 
insight into the potential degradation of parallelism as more processors become available. Amdahl’s 
law has a fixed problem size and is interested in how small the response time could be. It suggests 
that massively parallel processing may not gain high speedup. Gustafson [5] approaches the problem 
from another point of view. He fixes the response time and is interested in how large a problem 
could be solved within this time. This paper further investigates the scalability of problems. While 
Gustafson’s scalable problems are constrained by the execution time, the capacity of main memory 
is also a critical metric. For parallel computers, especially for distributed-memory multiprocessors, 
the size of scalable problems is often determined by the memory available. Shortage of memory is 
paid for in problem solution time (due to the I/O or message- passing delays) and in programmer 
time (due to the additional coding required to multiplex the distributed memory) [3], For many 
applications, the amount of memory is an important constraint to scaling problem size [6, 10], 
Thus, memory-bounded speedup is the major focus of this paper. 

We first study three models of speedup: fixed-size speedup, fixed-time speedup, and memory- 
bounded speedup. With both uneven workload allocation and communication overhead considered, 
speedup formulations will be derived for all three models. When communication overhead is not 
considered and the workload only consists of sequential and perfectly parallel portions, the simplified 
fixed-size speedup is Amdahl’s law; the simplified fixed-time speedup is Gustafson’s scaled speedup; 
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and the simplified memory-bounded speedup contains both Amdahl’s law and Gustafson’s speedup 
as special cases. Therefore, the three models of speedup, which represent different points of view, 

are unified. 

Based on the concept of scaled speedup, intensive research has been conducted in recent years 
in the area of performance evaluation. Some other definitions of speedup have also been proposed, 
such as generalized speedup, cost-related speedup, and superlinear speedup. Interested readers can 
refer to [14, 9, 16, 7, 18, 2, 8] for details. 

This paper is organized as follows. In Section 2 we introduce the program model and some 
basic terminologies. More generalized speedup formulations for the three models of speedup are 
presented in Section 3. Speedup formulations for simplified cases are studied in Section 4. The 
influence of communication/memory tradeoff is studied in Section 5. Conclusions and comments 

are given in Section 6. 

2 A Model of Parallel Speedup 

To measure different speedup metrics for scalable problems, the underlying machine is assumed to 
be a scalable multiprocessor. A multiprocessor is considered scalable if, as the number of processors 
increase, the memory capacity and network bandwidth also increase. Furthermore, all processors 
are assumed to be homogeneous. Most distributed-memory multiprocessors and multicomputers, 
such as commercial hypercube and mesh-connected computers, are scalable multiprocessors. Both 
message-passing and shared-memory programming paradigms have been used in such multiproces- 
sors. To simplify the discussion, our study assumes homogeneous distributed-memory architectures. 

The parallelism in an application can be characterized in different ways for different purposes 
[15]. For simplicity, speedup formulations generally use very few parameters and consider very high 
level characterizations of the parallelism. We consider two main degradations of parallelism, uneven 
allocation (load imbalance) and communication latency. The former degradation is application 
dependent. The latter degradation depends on both the application and the parallel computer 
under consideration. To obtain an accurate estimate, both degradations need to be considered. 
Uneven allocation is measured by degree of parallelism. 

Definition 1 The degree of parallelism of a program is an integer which indicates the maximum 
number of processors that can be busy computing at a particular instant in time, given an unbounded 

number of available processors . 

The degree of parallelism is a function of time. By drawing the degree of parallelism over the 
execution time of an application, a graph can be obtained. We refer to this graph an the parallelism 
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profile. Figure 1 is the parallelism profile of a hypothetical divide-and-conquer computation [13], 
By accumulating the time spent at each degree of parallelism, the profile can be rearranged to form 
the shape (see Figure 2) of the application [12]. 


Degree 

of 

Parallelism 

4- | 1 

3- | 1 I 1 

2 - | 1 I 1 

1 1 I 1 

0 Time T 

Figure 1. Parallelism profile of an application. 

Let W be the amount of work of an application. Work can be defined as arithmetic operations, 
instructions, or whatever is needed to complete the application. Formally, the speedup with N 
processors and with the total amount of work W is defined as 

where Ti{W ) is the time required to complete W amount of work on i processors. Let W x be 
the amount of work executed with degree of parallelism i, and let m be the maximum degree of 
parallelism. Thus, W = W{. Assuming each computation takes a constant time to finish on 

a given processor, the execution time for computing with a single processor is 

W 

h = ( 2 ) 

where A is the computing capacity of each processor. If there are i processors available, the 
execution time is 


W 

= JE- 


With an infinite number of processors available, the execution time will not be further decreased 
and is 
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Figure 2. Shape of the application* 


too(Wi) = 7 


Wi 


i A 


for 1 < i < m. 


Therefore, without considering communication latency, the execution times on a single processor 
and on an infinite number of processors are 


Ti (w) = £ 


i=i 


m 

A ’ 


m w 


1=1 


The maximum speedup, with work W and an infinite number of processors, is 


(3) 

(4) 


SooW = 


Ti(W) 


m Wj_ 
A = 1 A 


w,- 


Tco(W) E&iWi/i. 


(5) 


Average parallelism is an important factor for speedup and efficiency. It has been carefully 
examined in [4]. Average parallelism is equivalent to the maximum speedup Soo [4, 15]. 5^ 

gives the best possible speedup based on the inherent parallelism of an algorithm. There are no 
machine dependent factors considered. With only a limited number of available processors and 
with communication latency considered, the speedup will be less than the best speedup, 5 00 (1T). 
If there are N processors available and N < i, then some processors have to do ^-[^1 work and 
the rest of the processors will do wor k- By the definition of degree of parallelism, W, and 

Wj cannot be executed simultaneously for i / j. Thus, the elapsed time will be 
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Hence, 




TYl \XT 

w) = E^r^i, 


1=1 


( 6 ) 


and the speedup is 


s„m = Ml = -Siil 

v ' / T ' 1 /TI/\ . — M/. _ ■ 


Er=i^r^r (7) 

Communication latency is another factor causing performance degradation. Unlike degree of 
parallelism, communication latency is machine dependent. It depends on the communication net- 
work topology, the routing scheme, the adopted switching technique, and the dynamics of the 
network traffic. Let Q N {W) be the communication overhead when N processors are used to com- 
plete W amount of work. The actual formulation for Q N (W) is difficult to derive, as it is dependent 
on the communication pattern and the message sizes of the algorithm itself, as well as the system- 
dependent communication latency. Note that Q N {W) is encountered when there are N processors 

(N > 1). Assuming that the degree of parallelism does not change due to communication overhad, 
the speedup becomes 


S N (W) = 


T\(W) 

T n {W) 


(ESi t" InI ) + <?*( WO ‘ 


3 Speedup of Scaled Problems 


(8) 


In the last section we developed a general speedup formula and showed how the number of processors 
and degradation parameters influence the performance. However, speedup is not dependent only 
on these parameters. It is also dependent on how we view the problem. With different points 
of view, we get different models of speedup and different speedup formulations. One viewpoint 
emphasizes shortening the time it takes to solve a problem by parallel processing. With more and 
more computation power available, the problem can, in principle, be solved in less and less time. 
With more processors available, the system will provide a fast turnaround time and the user will 
have a shorter waiting time. A speedup formulation based on this philosophy is called fixed-size 
speedup. In the previous section, we implicitly adopted fixed-size speedup. Eq. (8) is the speedup 
formula for fixed-size speedup. Fixed-size speedup is suitable for many algorithms in which the 
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problem size cannot be scaled. 

For some applications we may have a time limitation, but we may not want to obtain the solution 
in the shortest possible time. If we have more computation power, we may want to increase the 
problem size, carry out more operations, and get a more accurate solution. Various finite difference 
and finite element algorithms for the solution of Partial Differential Equations (PDE’s) are typical 
examples of such scalable problems. 

An important issue in scalable problems is the identification of scalability constraints. One scal- 
ablility constraint is to keep the execution time unchanged with respect to uniprocessor execution 
time. This viewpoint leads to a different model of speedup, called fixed-time speedup. For fixed-time 
speedup the workload is scaled up with the number of processors available. Let W' = W/ be 

the total amount of scaled work, where W( is the amount of scaled work executed with degree of par- 
allelism i , and m' be the maximum degree of parallelism of the scaled problem when N processors 
are available. Note that the maximum degree of parallelism can change as the problem is scaled. In 
order to keep the same turnaround time as the sequential version, the condition Ti{W) = T^{W') 
must be satisfied for W' . That is, the following scalable constraint must be satisfied, 




Thus, the general speedup formula for fixed-time speedup is 


Txiyn 

T n (W') 


T£I+Qjv(W") 


In many parallel computers, the memory size plays an important role in performance. Many 
large scale multiprocessors with local memory architecture do not support virtual memory due to 
insufficient I/O network bandwidth. When solving an application with one processor, the problem 
size is more often bounded by the memory limitation than by the execution time limitation. W ith 
more processors available, instead of keeping the execution time fixed, we may want to meet the 
memory size constraint. In other words, if you have adequate memory space and the scaled problem 
meets the time limit imposed by fixed-time speedup, will you further increase the problem size to 
yield an even better or more accurate solution? If the answer is yes, the appropriate model is 
memory-bounded speedup. Like fixed-time speedup, memory-bounded speedup is a scaled speedup. 
The problem size scales up with memory size. The difference is that in fixed-time speedup execution 
time is the limiting factor and in memory-bounded speedup memory size is the limiting factor. 

With memory size considered as a factor of performance, the requirements of an algorithm 
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consist of two parts. One is the computation requirement, which is the workload, and the other 
is the memory (capacity) requirement. For a given algorithm, these two requirements are related 
to each other, and the workload might be viewed as a function of the memory requirement. Let 
M represent the memory size of each processor. Let g be a function such that W = g{M), or 
M = </ -1 (VF), where g~ x is the inverse function of g. An example of function g and g can 
be found in Section 5. In a homogeneous, scalable, parallel computer, the memory capacity on 
each node is fixed and the total memory available increases linearly with the number of processors 
available. If W = Wi is the workload for execution on a single processor, the maximum scaled 

workload with N processors, W m = ££*, W‘ must satisfy the following scalable constraint, 

W* = g(NM) = g(Ng~ l (W)), ( u ) 


where m* is the maximum degree of parallelism of the scaled problem and g is determined by 
the algorithm. The memory limitation can be stated as: the memory requirement for any active 
processor is less than or equal to M = g~ x { ££i Wi). Here the main point is that the memory 
occupied on each processor is limited. By considering the communication overhead, Eq. (12) is the 
general speedup formula for memory-bounded speedup. 


S N {W) 


E m* 

i = 1 


Em wj 


( 12 ) 


4 Simplified Models of Speedup 

The three general speedup formulations contain both uneven allocation and communication latency 
degradations. They give better upper bounds on the performance of parallel applications. On 
the other hand, these formulations are problem dependent and difficult to understand. They give 
detailed information for each application, but lose the global view of possible performance gains. In 
this section, we make some simplifying assumptions. We assume that the communication overhead 
is negligible, i.e., Q N = 0, and the workload only contains two parts, a sequential part and a 
perfectly parallel part. That is, Wi = 0, for * f 1 and i±N. We also assume that the sequential 

part is independent of the system size, i.e., W x = W[ = Wf. 

Under this simplified case, the general fixed-size speedup formulation (Eq. 8) becomes 


S N (W) = 


W\ + W n 

w x + yf' 


(13) 


Eq. (13) is known as Amdahl’s law. Figure 3 shows that when the number of processors increases 
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the load on each processor decreases. Eventually, the sequential part will dominate the performance 
and the speedup is bounded by . In Figure 3, Tj is the execution time for the sequential 

portion of the work, and Tjf is the execution time for the parallel portion of the work. 




Number of Processors (N) Number of Processors (N) 

Figure 3. Amdahl’s law. 

For fixed-time speedup and under the simplified conditions, the scalability constraint (Eq. 9) 

becomes 

W 

m + w N = w; + (i4) 

Since VFi = W[, we have W N = That is W' N = NW n . Eq. (10) becomes 


Sn{W) = 


Wi + NW n 
w, + W N • 


(15) 


The simplified fixed-time speedup (Eq. 15) is known as Gustafson’s scaled speedup [5]. From 
Eq. (15) we can see that the parallel portion of an application scales up linearly with the system 
size. The relation of workload and elapsed time for Gustafson’s scaled speedup is depicted in Figure 
4. 


We need some preparation before deriving the simplified formulation for memory- bounded 
speedup. 


Definition 2 A function g is a semihomomorphism if there exists a function g such that for any 
real number c and any variable x , g(cx) = g(c)g(x). 
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Figure 4. Gustafson’s scaled speedup. 


One class of semihomomorphisms is the power function g(x) — x\ where b is a rational number. 
In this case, g is the same as the function g . Another class of semihomomorphisms is the single 
term polynomial g(x ) = where a is a real constant and b is a rational number. For this kind 
of semihomomorphism, g(x) = x\ which is not the same as g(x). 

Under our assumptions, the sequential portion of the workload, W\ , is independent of the system 
size. If the influence of memory on the sequential portion is not considered, i.e., the memory capacity 
M is used for the parallel portion only, we have the following theorem. 


Theorem 1 If W = g(M) for some semihomomorphism g } g(cx) — g(c)g(x), then , with all data 
being accessible by all available processors and using all available memory space , the simplified 
memory-bounded speedup is 


S N (W*) 


Wi + g(N)W N 

Wi + SJ ^w N 


(16) 


Proof: Assume that the maximum problem size will take the maximum available memory 

capacity of M when one processor is used. As mentioned before, when one processor is available, 
the parallel portion of the workload, W N , can be expressed as W N = g{M). Since all data are 
accessible by all processors, there is no need to replicate the data. With N processors available, 
the total available memory capacity will be increased to NM. The parallel portion of the problem 
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can be scaled up to use all available memory capacity NM. Thus, the scaled parallel portion, 
is expressed as - g(NM) = g(N)g(M). Therefore, Wfj — g{N)Ww and 


Wf + W N Wi + g(N)W N 

- w* + wyN - w 1 + '^w n 


( 17 ) 

□ 


Note that in Theorem 1, we made two assumptions in the simplified case: 1) Since the commu- 
nication latency is ignored, remote memory accesses take the same time as local memory accesses. 
This implies that the data is accessible by all available processors, and 2) All the available memory 
space is used for a better solution. These simplified speedup models are useful to demonstrate how 
the sequential portion of an application, W\, will affect the maximum speedup that can be achieved 
with different number of processors. Let k = • T ^e simplified fixed-size speedup, fixed-time 

speedup, and memory-bounded speedup are, respectively, 


S N {W) = 


N 

1 + k(N- 1)’ 


(18) 


SVCfT') = N — k(N — 1) = k + N( 1 — k ), and 


S N 



g(N) + k(l-g(N)) 
g(N) + k(N - g(N)) 


) 


(19) 

( 20 ) 


When the number of processors, JV, goes to infinity, Eq. (18) is bounded by the reciprocal of 
k } which gives the maximum value of the fixed-size speedup. Eq. (19) shows that the fixed-time 
speedup is a linear function of the number of processors with slope equal to (1 — k). When N 
goes to infinity, this speedup can increase without bound. Memory-bounded speedup depends on 
the function g(N). When g(N ) = 1 , memory-bounded speedup is the same as fixed-size speedup. 
When g(N) = JV, the memory-bounded speedup is the same as the fixed-time speedup. In general, 
the function g(N ) is application dependent and g(N ) > N. It implies that when the problem size 
is increased by iV, the amount of work increases more than N times. It is easy to verify that 
57v (W*) > Sn(W') when g(N) > N. Note that all data in memory is likely to be accessed at least 
once. Thus, for scaled problems, g(N) < N is unlikely to occur. The sequential portion of the 
work plays different roles in the three definitions of speedup. In fixed-size speedup, the influence 
of the sequential portion increases with system size and eventually dominates the performance. In 
fixed-time speedup, the influence of the sequential portion is unchanged which makes the speedup 
a linear function of system size. In the memory- bounded speedup, since in general g(N) > TV, 
the influence of the sequential portion is reduced when the system size increases, indicating that a 
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better speedup could be achieved with a larger system size. 

The function g(N) provides a metric to evaluate parallel algorithms. In general, g(N) may not 
be derivable for a given algorithm. Note that any single term polynomial is a semihomomorphism, 
and most solvable algorithms have polynomial time computation and memory requirement. If we 
take an algorithm’s computation and storage complexity (the term with the largest power) as its 
computation and memory requirement, for any algorithm with polynomial complexity there exists a 
semihomomorphism g, such that W = g(M). The approximated semihomomorphism g will provide 
a good estimation on the memory-bounded speedup when the number of processors is large. More 
detailed case studies for the three models of speedup can be found in [13]. 

Figure 5 demonstrates the difference between the three models of speedup when k = 0.3 and N 
ranges from 1 to 1024. For the simplified memory-bounded (SMB) speedup, we choose g(N) = N%, 
which is typical in many matrix operations to be described later. When g(N) = N, it is Gustafson’s 
scaled speedup. The case of G(N) = (1 + £[1 - will be studied in next section. 



Figure 5. Amdahl’s law, Gustafson’s speedup, and SMB speedup for k = 0.3. 


5 Communication-Memory Tradeoff 

The simplified speedup formulations give the impact of the sequential portion of an application 
on the maximum speedup. The simplified memory-bounded speedup suggests that when data are 
shared by all processors, maximum speedup is obtained. However, in practice if communication 
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overhead is considered, the data sharing approach may not lead to maximum speedup. In the 
design of efficient parallel algorithms, the communication cost plays an important role in deciding 
how a problem should be solved and scaled. One way to reduce the frequency of communication 
is to replicate some shared data to processors. Thus, a good algorithm design should consider the 
tradeofF between the maximum size that a problem can scale and the reduction of available memory 
due to the replication of shared data. 

If data replication is allowed, the function W = g(NM) will no longer hold. Motivated by 
Theorem 1, the function G(N) = W^/Wn is defined to represent the ratio of work increment 
when N processors are available. In terms of G(N), the simplified memory-bounded speedup is 
generalized below. 

Theorem 2 If W\ is independent of system size, W{ = 0 for 1 < i < N, and VEjy = G(N)W jy for 
some function G(N), the memory-bounded speedup is 


S N (W) 


W\ + G(N)Wn 

Wl + cm WN + Q N (w*)' 


( 21 ) 


The proof of Theorem 2 is similar to the proof of Theorem 1. Eq. (21) shows that the maxi- 
mum speedup is not necessarily achieved when G(N) - g(N). Note that the communication cost 
Qn{W*) is a unified communication cost. An optimal choice of the function G(N) is both algo- 
rithm and architecture dependent and, in general, is difficult to obtain. Also, unlike g(N), G(N) 
might be less than N. If G(N) < N, memory capacity is likely to be the scalable constraint when 
N is large. If G(N) > N, execution time is likely to be the scalable constraint. The function G(N ) 
indicates the possible scalable constraint of an algorithm. The proposed scaled speedup (Eq. 21) 
may not be easy to fully understand at first glance. Hence, we use matrix multiplication as an 
example to illustrate it. 

A matrix often represents some discretized continuum. Enlarging the matrix size will generally 
lead to a more accurate solution for the continuum. For matrix multiplication C = AB , there are 
many ways to partition the matrices A and B to allow parallel processing [11]. Assume that there 
are N processors available, and A and B are n x n matrices when executing on a single processor. 
The computation requirement is 2 n 3 and the memory requirement is roughly 3n 2 . Thus, Wn = 2 n 3 
and M = 3 n 2 . Two extreme cases of memory-bounded scaled speedup are considered. 


Local Computation 

In the first case, we assume that the communication cost is extremely high. Thus, data should be 
replicated if possible to reduce communication. This can be achieved by partitioning the columns 
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of matrix B into N submatrices, B 0 , B u « - - , B N ^ X and replicating the matrix A. Thus, B{ s are 
distributed among all the processors and matrix A is replicated on each processor. Processor i 
does the multiplication AB t = C,, i = 0, • • • , TV - 1, independently. Since there is no need for 
communication, it is referred to as local computation approach. Figure 6(a) shows the partitioning 
of B for the case of N = 4. 


A 


(a) The matrix B is partitioned. 



Ao 
A i 

Bo 

A 2 

A 3 

(b) Both matrices A and B are partitioned. 

Figure 6. Two partitioning schemes of matrices A and B. 

If both A and B are allowed to scale along any dimension and A and B are not necessary to be 
square matrices, the enlarged problem is A*B* = C* , where A* is an t x k matrix, B* is a k x m 
matrix, and the resulting matrix C* is an l x m matrix. Note that the local memory capacity 
is M = 3 n 2 . It is easy to see that the maximum memory-bound speedup will be achieved when 
l — k — n, and m = nN. In other words, both B and C are scaled up N times along their rows, and 
A is replicated but not scaled. The amount of computation on each processor is fixed, Wjsj = 2 n 3 , 
and = NWjy. Thus, we have G(N) = N . The memory-bounded scaled speedup is 



S N (W*) 


Wi + NW n 

Wx + Wn ’ 


which is Gustafson’s scaled speedup. Thus, the best performance of memory-bounded speedup 
using the local computation model is the same as the Gustafson’s scaled speedup. In general, the 
local computation model will lead to a speedup that is less than Gustafson’s scaled speedup. For 
example, if both A and B are restricted to square matrices, the function G(N ) will be 
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G(N) = 


3 


1 3 N 
N + 2 




3 

which is less than N, for N > 1, and is bounded by 3? (see Appendix). Note that due to data 
replication, the memory capacity requirement increases faster than the computation requirement 
does. 


Global Computation 

In the second extreme case, we assume that the communication cost is negligible. Thus, there is 
no need to replicate the data. A bigger problem can be solved. We partition matrix A into N row 
blocks and B into N column blocks (See Figure 6(b)). By assigning each pair of submatrices, A; and 
Bi, to one processor initially, all main diagonal blocks of C can be computed. Then, the row blocks 
of A are rotated from one processor to another after each row-column submatrix multiplication. 
With N processors, N - 1 rotations are needed to finish the computation as shown in Figure 7 for 
the case of N = 4. This method is referred to as global computation. 

For the global computation approach, the maximum scaled speedup is achieved when t — k = 
m = ny/N (see Appendix). 


S N {W) = 


W\ + N* W N 


( 22 ) 


W x + N*W n 

The corresponding function G(N) = N f . Assuming N < n 2 , we can write W N as a function of M 
as follows, 

W N = g(M) = (^y . (23) 

Increasing the total memory capacity to N A/, we have 


W* N = 


^2 JVJlfj) 


= TV* 


3 /2M\ 2 


(s 


N*W n = g(N)W N . 


(24) 


The matrix multiplication problem has a semihomomorphism between its memory requirement and 
computation requirement and g(N) = N*. Assuming a negligible communication cost, the global 
computation approach will achieve the best possible scaled speedup of the matrix multiplication 
problem. 

We have studied two extreme cases of memory-bounded scaled speedup which are based on 
global computation and local computation. In general for most of the algorithms, part of the data 
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(d) step 4 


Figure 7. Matrix multiplication without data replication. 
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may be replicated and part of the data may have to be shared. Deriving a speedup formulation for 
these algorithms is difficult, not only because we are facing a more complicated situation, but also 
because the ratio between replicated and shared data is uncertain. The replicated part may not 
increase as the system size is increased. In case the replicated part does increase, its speed of increase 
may be different from the speed that the shared part is increased. Also, an algorithm may start with 
global computation. When the system size is increased, replication may be needed as part of the 
effort to reduce communication overhead. A special combined case, G(N ) = (1 + g[l - 21^1]) AT, 
has been carefully studied in [15]. The structure of that study can be used as a guideline for other 
algorithms. 

The influence of communication overhead on the best performance of the memory-bounded 
speedup is studied. The study can be extended to fixed-time speedup, where redundant computa- 
tion could be introduced to reduce the communication overhead. The function G(N ) determines 
the actual achieved speedup. We have shown how the partition and scale of the problem will influ- 
ence the function G(N). In general, finding an optimal function (j(fV) is a non-linear optimization 
problem. The concept of the function G(N) can be extended to algorithms with multi-degree of 
parallelism. 

6 Conclusions 

It is known that the performance of parallel processing is influenced by the inherent parallelism 
and communication requirement of the algorithm, by the computation and communication power 
of the underlying architecture, and by the memory capacity of the parallel computer system. How- 
ever, how are these factors related to each other, and how do they influence the performance of 
parallel processing is generally unknown. Discovering the answers to these unknowns is important 
for designing efficient parallel algorithms. In this paper one model of speedup, memory-bounded 
speedup , is carefully studied. The model contains these factors as its parameters. 

As part of the study on performance, two other models of speedup have also been studied. 
They are fixed-size speedup and fixed-time speedup. Two sets of speedup formulations have been 
derived for these two models of speedup and for memory-bounded speedup. Formulations in the 
first set give rise to generalized speedup formulas. The second set of formulations only considers a 
special, simplified case. The simplified fixed-size speedup is Amdahl’s law, the simplified fixed-time 
speedup is Gustafson’s scaled speedup, and the simplified memory-bounded speedup contains both 
Amdahl’s law and Gustafson’s scaled speedup as special cases. 

The three models of speedup, fixed-size speedup, fixed-time speedup and memory-bounded 
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speedup, are based on different viewpoints and are suitable for different classes of algorithms. How- 
ever, algorithms exist which do not fit any of the models of speedup, but satisfy some combination 
of the models. 


Appendix 

When communication does not occur (local computation) or its cost is negligible, the memory- 
bounded speedup equation (21) becomes 

_ Wi + G(N)W n ( 2 s) 

N ~W 1 + SSfflW N ' 

It is easy to verify that S* N increases with the function G(N). Thus, for the two extreme cases 
considered in Section 5, the problem of how to reach the maximum speedup becomes how to scale 
the matrix A and B such that the function G(N) reaches its maximum value. The matrix A and 
B can be scaled in any dimension. A general scaled matrix multiplication problem is 


Gl*m, 

where both A and B are rectangular matrices. To achieve an optimal speedup, we need to decide 
the integers l , fc, and m, for which that the function G(N) reaches the maximum value. The 
following result gives the optimal /, k, and vn for the global computation approach (Fig. G(b)) 
given in Section 5. Recall that N is the number of processors. 

Proposition 1 If A and B are n x n matrices when N — 1, then the global computation approach 
reaches the maximum G(N ) when l = k~nandm^nx y/N, excluding the communication cost. 
The corresponding G(N) equals JV 3 / 2 , and the maximum speedup is 

Wl + N */2 Wn 
A Wi + N'GWn ' 

Proof: By the partition schema of the global computation approach, the rows of matrix A 

and the columns of matrix B are distributed among processors. The workload on each processor is 


A ±*k B k*% 


Cj_ 

N 


m . 
N 


Since the memory is fully filled, 


l , , 

— + k + k * 
N 


m l m 2 

* — - 3 n . 

N N N 
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Thus, 


r \ r n^‘ _L_ x 2 2L 

k - 671 N * N 

l I m 

N N 


The work of the scaled problem i 


is 


(27) 


ITy = 2*/*ra*fc = 2*/*m* 

o2 


3n 2 N — l * m 


l 4- m 


_ 2n 3 l * m 3n 2 iV - / * m _ (3n 2 - N - l * m)(l * m) 


/ + m 


(/ 4* m) * n 3 


W]v. 


_ (3n 2 - N - / * m)(/ * m) 

(/ + rn) * n 3 

Therefore, G(N) reaches its maximum value if and only if the function 


(28) 


rn \ (3rc 2 — N — l * m)(l * m) 

i — 11 L 

l + rn 


reaches its maximum value. At its maximum value, the derivatives of /(/, m) satisfy 


f'l = - l 2 m 2 - 2 lm 2 + 3 n 2 m 2 N = 0, 

f 'm = -l 2 m 2 - 2ml 2 + 3 n 2 m 2 N = 0. 


It leads to 


l 2 + 2lm-3n 2 N = 0, 


(29) 


m 2 + 2/m - 3 n 2 N = 0. 


This is 


(30) 


(/ + m) 2 = m 2 + 3 n 2 N. 
( m + l) 2 = l 2 + 3 n 2 N. 


Thus, we have m 2 = l 2 , i.e. 
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I — m 


(31) 


Combining the Eq. (31) and Eq. (29), we get 

/ = m = n\/~N. 


From the Eq. (27), we have k = n\fN . Thus, the enlarged A and B are still square matrices, with 
dimension n\/jV\ By Eq. (28) the maximum G(1V) is 


G(N) = 


(ny / F) 2 (3n 2 iV - (ny/jV) 2 ) _ 


n 3 (2n\/W) 


= N !, 


which is equal to the memory-work function g(N) for the matrix multiplication problem (see Section 
5), and the corresponding speedup is 


Wi + N 3 ' 2 W n 

- Wi + n^ 2 W n ' 


From Theorem 1, it is the best possible performance for the matrix multiplication problem. 


Using similar arguments as in Proposition 1, we can find that the optimal dimension of the local 
computation approach is / = k = n, m = nN , and the maximum value of G{N) is N (see Section 
5). The scalability of matrix A and B is application dependent. If A and B should be maintained as 
square matrices, the following proposition shows the limitation of the local computation approach. 

Proposition 2 If A and B are n x n matrices when N = 1, and l = k = m is required, then the 
maximum value of G{N) of the local computation approach is , which is bounded by 3^ 

and is smaller than N , for N > 1. 

Proof: When A and B are square matrices, the scaled problem is 


Af; m kBk*k — Ck*k • 

If the load is balanced on each processor, and m = ft is an integer, then each processor does the 
work 


Ak+kB k*m 


= c k 


*m * 
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When memory is fully used, 


k 2 + 2k * m — 3 n 2 . 


Since m = A, 


Thus, 


The scaled work 


and 


Since 


and 


,2 2 k 2 n 2 

k 4- ■ — 3n . 

N 


k = 


W* N = 2k 3 = 


I 3 n 2 

1 + — 
1 ^ N 


I 3 N 
iV + 2 


I 3N 
N + 2 


G(N) = 


2 ra 3 = 


3JV 


I 3 JV 
iV + 2 


WOv, 


iV + 2 

3JV 3iV + 6 6 


= 3- 


iV + 2 7V + 2 N + 2 ~ N + 2 

3N 3 

x N, 


N + 2 N + 2 

the G(N) is bounded by 35 and is smaller than N , for N > 1. 
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